Introduction

Welcome to our Natural Language Processing tutorial!

High-level overview:

  • Create an environment to use R AND python in the the SAME RMarkdown!

  • Access the Genius API to easily create custom dataset

  • Overview of Natural Language Processing

  • Text Pre-processing and formatting

  • Sentiment analysis

  • Topic modeling (bag of words): Latent Dirichlet Allocation

  • BERTopic advanced models

#install.packages('reticulate')
#install.packages('dotenv')
library("reticulate")     # Incorporates python code
#install_miniconda()
library('dotenv') # Uses .env files to hide sensitive information; i.e. access codes

More info about using .env files: https://medium.com/towards-data-science/using-dotenv-to-hide-sensitive-information-in-r-8b878fa72020

Configure Environment

  • For the first part of this tutorial we use the reticulate package to set up a special environment to which we can install python packages. This allows us to use both R AND python code within the same RMarkdown document.

  • It’s also possible to incorporate python classes into RMarkdown: http://theautomatic.net/2020/01/14/how-to-import-python-classes-into-r/

The following descriptions are from the reticulate github repo: https://rstudio.github.io/reticulate/

  • The reticulate package includes a Python engine for R Markdown with the following features
  • Run Python chunks in a single Python session embedded within your R session (shared variables/state between Python chunks)
  • Printing of Python output, including graphical output from matplotlib. -
  • Access to objects created within Python chunks from R using the py object (e.g. py$x would access an x variable created within Python from R).
  • Access to objects created within R chunks from Python using the r object (e.g. r.x would access to x variable created within R from Python)
  • Using virtualenvs is supported on Linux and Mac OS X, using Conda environments is supported on all platforms including Windows

Create a conda environment to load python packages

  • conda_list() List all available conda environments
  • conda_create() Create a new conda environment
  • conda_install() Install a package within a conda environment
  • conda_remove() Remove individual packages or an entire conda environment

Disclaimer: This code below using reticulate ‘should’ run without much issue on a Windows PC, but Mac OS and Linux may have unforeseen difficulties. Check dependencies, etc.; good luck!

# Initial installation code to the environment
# This chunk should take a couple minutes to install

# conda_create('r-reticulate')
# 
# # Use pip=T for non-conda packages
# conda_install('r-reticulate',"scikit-learn")
# conda_install('r-reticulate',"lyricsgenius", pip=T)   
# conda_install('r-reticulate',"contractions", pip=T)
# conda_install('r-reticulate',"nltk")
# conda_install('r-reticulate',"numpy")
# conda_install('r-reticulate',"pandas")
# conda_install('r-reticulate','gensim')
# conda_install('r-reticulate','python-flair')
# conda_install('r-reticulate', 'BERTopic')
# conda_install('r-reticulate', 'plotly')

Note: Install scikit-learn but then import sklearn

use_condaenv("r-reticulate") # Loads a pre-existing conda environment

#Imports python packages into the environment
import('sklearn')         # Comprehensive machine learning toolkit
## Module(sklearn)
import('lyricsgenius')    # access the Genius API
## Module(lyricsgenius)
import("nltk")            # Natural language toolkit
## Module(nltk)
import('contractions')    # expands contractions 
## Module(contractions)
import('gensim')          # Topic modeling for language 
## Module(gensim)
import('flair')           # State-of-the-Art NLP techniques
## Module(flair)
import('bertopic')        # Advanced topic modeling
## Module(bertopic)
import('numpy')           # duh
## Module(numpy)
import('pandas')          # duh
## Module(pandas)
import('plotly')
## Module(plotly)

Creating Music Datasets

Using the lyricsgenius package we can easily access the Genius api, and with a couple of functions provided below we can have freedom to create datasets of the Genius Top charts with one line of code.

Note: Additional functions and instructions provided at the end of the tutorial

Some setup code

path <- getwd()       # current working directory
load_dot_env("tokens.env")      # Access hidden .env file
client_access_token <- Sys.getenv("client_access_token") # get access code

Note: If you get “Warning: incomplete final line found on ‘tokens.env’”, try hitting enter at the end of your .env file

Switching to python code

# ^^^ observe
import sklearn
import lyricsgenius
import nltk
import pandas as pd
import numpy as np
# Able to directly load base python packages  
import os

genius = lyricsgenius.Genius(r.client_access_token) # Genius API agent

Functions to Create Datasets

def top_charts(time_period='all_time',genre='all_time',
                    n_per_page=50,type_='songs',pages=1):
  
#  Purpose: Access the topcharts Genius API to create a dataset
#    - Results vary, but size is less than 200
#    - Used in conjunction with song_info function
#  Input:
#    - tp: time period ‘day’, ‘week’, ‘month’ or ‘all_time’
#    - genre: ‘all’, ‘rap’, ‘pop’, ‘rb’, ‘rock’ or ‘country’
#    - per_page: 1 - 50
#    - type_: item type: ‘songs’, ‘albums’, or ‘artists’
#    - page: number of page number
# Output: 
#   dataframe of song ids, titles, artist, and lyrics


  song_ids = list() # Lists to add to output data frame 
 
 # while-try loop because the request sometimes times out and will kill the loop
  for pn in range(1,pages+1):
      t = True
      while t == True:
          try:
              songs = genius.charts(page=pn,time_period=time_period,
                              chart_genre=genre,per_page=n_per_page,type_=type_)
          except:    
              pass
          else:
              t = False
              n = len(songs['chart_items'])  # number of hits
              # get song ids
              for song in range(0,n):
                  language = songs['chart_items'][song]['item']['language']
                  if language == 'en':
                      song_id = songs['chart_items'][song]['item']['api_path'].replace('/songs/','')
                      song_ids.append(song_id)
                      
  # call song_info function to retrieve lyrics
  topchart_df = song_info(song_ids) 
  
  # option to save the dataframe as a csv file
  path = os.getcwd()
  csv_name = 'topchart_'+time_period+'_'+type_+'_'+genre+'.csv'
  #song_df.to_csv(path+csv_name, index=False)
  
  return topchart_df
def song_info(ids,time_period='all_time',genre='all',type_='songs'):
  
  # Input: list of song ids
  # Output: dataframe with song ids, artist names, lyrics, song titles.

  try:
      type(ids) == list
  except:
      print('input needs to be a list')
  lyrics = list()         
  titles = list()         
  artists = list()
  bad_song_ids = list()
  song_ids = ids
  
  # Access Genius API for each song_id
  # The try/except/pass code is to protect the dataset creation from being
  # terminated if there is a problem with any API call, and sometimes the Genius
  # API for 
  for song in song_ids:
    t = True
    while t == True:
        try:
            a = genius.search_song(song_id=song)
        except:    
            pass
        else:
            t = False
            if (a!=None) and (a.to_text()!=None):
                lyrics.append(a.to_text())
                titles.append(a.title)
                artists.append(a.artist)
            else:
                bad_song_ids.append(song)
                
  # Eliminate corrupt songs
  song_ids = [x for x in song_ids if x not in bad_song_ids] 
  
  # output data frame                                   
  song_df = pd.DataFrame({
      'title': titles,
      'lyrics': lyrics,
      'artist': artists,
      'song_ids': song_ids,})
  
  return song_df

Example Dataset Creation Code

# dataset1 = top_charts(genre='country',n_per_page=50, pages=1, 
#        type_='songs',time_period='all_time')

Dataset options:

  • genre: rap, rock, pop, rb (r&b), country, or all
  • type: songs, albums, or artists
  • time period: day, week, month, or all_time

Note: The repeated calling of Genius API will (rarely) cause the automated functions to terminate, even with the try/expect/pass code, but it should work if you retry one or two more times.

Caveat: Some songs may appear twice with different versions (remix, acoustic). We decided not to remove duplicates because duplicate versions both appearing in top charts shows just how popular the song is.

Load previously created datasets

Here I load previously created datasets that were created with the same functions above

path = os.getcwd()

path_rap = path+'/topchart_all_time_songs_rap.csv'
rap_df = pd.read_csv(path_rap, index_col=False)

path_rock = path+'/topchart_all_time_songs_rock.csv'
rock_df = pd.read_csv(path_rock, index_col=False)

path_pop = path+'/topchart_all_time_songs_pop.csv'
pop_df = pd.read_csv(path_pop, index_col=False)

path_rb = path+'/topchart_all_time_songs_rb.csv'
rb_df = pd.read_csv(path_rb, index_col=False)

path_country = path+'/topchart_all_time_songs_country.csv'
country_df = pd.read_csv(path_country, index_col=False)

dfs = [rap_df,rock_df,  pop_df, rb_df, country_df]
full_df = pd.concat(dfs)

Pre-processing

Pre-processing Description

Before we can analyze our data, we must pre-process our text, as one would do for any Natural Language Processing. The main pre-processing steps we will focus on are data cleaning, tokenization, removing stop words, and normalization/lemmatization.

The first step is to clean our data of any unintended characters or symbols present in our text. As our data comes from using the package ‘lyricgenius’ to access the Genius API, there are many characters present in our text that we would wish to have removed. We can use regular expressions to search for these characters or groups of characters and remove them from our data.

Our next step is to tokenize our data. In NLP tokenizing is the process of turning unstructured data into discrete, usable units that we will use for our natural language processing. There are various forms of tokenizing, but we will use word tokenizing for our problem. This will split up our data, in this case the lyrics to each song, into separate units for each word in the lyrics. This unit, word, is what the natural language processing will use to analyze (as opposed to using words + common phrases or advanced modeling techniques that consider the ordered sequence of words like transformer models. Another common tokenizer used is sentence tokenization, which splits up the data by sentence.

Once our data is tokenized, we can remove stop words from our data. Stop words are common words that are not useful in NLP. Think words like ‘the’, ‘and’, or ‘an’. We remove these words before processing because they would serve little value in helping us classify our songs, and could even have significant negative impacts on our analysis by diluting the relevant words in data. We use the nltk library’s default stopwords, as well as a few more that are common in song lyrics. Additionally, we have included explicit words as stopwords, as we would rather not use these words to classify our genres.

Finally, we must normalize our text in some way. The two most common forms of text normalization are stemming and lemmatization. Stemming is a “heuristic” based approach that removes common ends to words, leaving just the “stems” remaining, e.g. boats, boatness -> boat. Lemmatizing, on the other hand, applies a “morphological” analysis of each word to determine the base, or “lemma”, of the word, e.g. mice -> mouse. Lemmatization is typically preferred to stemming, as it provides a more complex analysis of the true meaning of each word, and so we will use it in our analysis.

Pre-processing Code

import re             # python regular expressions
import copy
import contractions
import nltk
#nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
def process_text(df):
  
  # Performs various text pre-processing steps
  # Input: Data frame with a column named 'lyrics'
  # Output: Same dataframe but with pre-processed text
  
  df1 = copy.deepcopy(df)
  lyrics = df1.lyrics
  lyrics_final = list()
  
  snow_stemmer = SnowballStemmer(language='english')
  wnl = WordNetLemmatizer()
  
  for lyric in lyrics:
    # Removes brackets and text inside
    song_lyrics = re.sub(r'\[.*?\]', '', lyric)      
    # Removes parentheses and text inside
    song_lyrics = re.sub(r'\(.*?\)', '',song_lyrics)    
    # Finds start of lyrics
    song_lyrics = song_lyrics[song_lyrics.find('Lyrics')+6:] 
    # Removes newlin char (\n)
    song_lyrics = re.sub("\n"," ",song_lyrics)          
    # Removes leftover backslahes 
    song_lyrics = re.sub('\'', "plac3h0ler",song_lyrics)    
    # Removes leftover backslahes 
    song_lyrics = re.sub('plac3h0ler', r"'",song_lyrics)  
    # Removes text at the end of doc
    song_lyrics = re.sub(".{3}Embed", "",song_lyrics)         
    # Lengthes contractions to full form
    song_lyrics = contractions.fix(song_lyrics)               
    # Removes punctuation
    song_lyrics = re.sub(r'[^\w\s]','',song_lyrics)   
    # Removes numbers
    song_lyrics = re.sub("[^a-zA-Z]+", " ",song_lyrics)  
    # Tokenize words
    word_tokens = word_tokenize(song_lyrics)
    # Lemmatize words
    lemma_words_tokens = [wnl.lemmatize(token) for token in word_tokens]    
    
    # stopwords 
    stop_words = stopwords.words('english')  
    sw = ['ayy', 'like', 'come', 'yeah', 'got', 'la', 'ya',
          'oh', 'ooh', 'huh', 'whooaaaaa', 'o', 'n', 'x']
    explict_words = ['nigga', 'nigger', 'bitch', 'bitchin', 'fag', 'faggot',
                     'fuck', 'fucked', 'fuckin', 'motherfucker', 'motherfuckin',
                     'pussy', 'dick', 'cock', 'whore','shit', 'shittin']
    stop_words_final = stop_words + sw + explict_words
    
    # Remove stopwords
    filtered_lyrics = [token.lower() for token in lemma_words_tokens if 
              token.lower() not in stop_words_final] 
    
    # Join lyrics into one string
    lyrics_joined = ' '.join(filtered_lyrics).lower()
    
    lyrics_final.append(lyrics_joined)
  
  df1 = df1.drop(['lyrics'], axis=1)
  df1['lyrics'] = lyrics_final
  
  return df1

Apply pre-processing to datasets

cleaned_full_df = process_text(full_df)
cleaned_rap_df = process_text(rap_df)
cleaned_rock_df = process_text(rock_df)
cleaned_pop_df = process_text(pop_df)
cleaned_rb_df = process_text(rb_df)
cleaned_country_df = process_text(country_df)
pd.set_option('display.max_columns', None)
print(cleaned_full_df.head())
##             title          artist  song_ids  Unnamed: 4  \
## 0         Rap God          Eminem    235729         NaN   
## 1             WAP         Cardi B   5832126         NaN   
## 2         HUMBLE.  Kendrick Lamar   3039923         NaN   
## 3  Bad and Boujee           Migos   2845980         NaN   
## 4      SICKO MODE    Travis Scott   3876994         NaN   
## 
##                                               lyrics  
## 0  look wa going go easy hurt feeling going get o...  
## 1  whores house house house house said certified ...  
## 2  nobody pray day way remember syrup sandwich cr...  
## 3  know young rich know something really never ol...  
## 4  astro sun freezin cold already know winter daw...

Text Analysis

Now that we have finished the ‘setup’ part we can finally get into the fun stuff

Natural language processing seeks to translate human language, like text or speech, to comprehensible and analyzable pieces for learning machines. NLP has common applications such as speech recognition, topic extraction, name-entity recognition, and sentiment analysis. The abundance of text data readily available has made natural language processing a growing field.

We will take a look at 2 main uses for natural language processing: Sentiment Analysis and Topic Modeling. Sentiment Analysis is used to determine the emotional sentiment around text, typically used to analyze reviews. Topic Modeling is an unsupervised machine learning technique aimed at classifying different documents into topics based on the words within each document. Both of these techniques have applications to our lyrics data, as we will be able to identify the sentiment of songs, as well as find possible topics among the songs.

Sentiment Analysis

Sentiment Analysis Description

Sentiment Analysis is a multinomial text classification where the emotional weight of the text (positive, neutral, negative) is calculated using natural language processing. Sentiment analysis has many applications, especially in analyzing reviews, surveys, and media. There are 2 major types of sentiment analysis: rule-based and embedding based.

Rule-based is the simpler approach, does not leverage machine learning, and bases its calculations on known datasets of words. This means that rule-based sentiment analysis could indicate which songs are negative if they use a common word like “sad”, but terms unfamiliar to a rule-based sentiment analysis library would be ignored and would not be able to be predicted. It also is unable to understand the context in which words are used, meaning that homonyms, often pop culture homonyms in the context of song lyrics, can only be interpreted one way with this approach.

Embedding based sentiment analysis, on the other hand, forms vector representations of words, where similar words are dimensionally similar. These vector representations can also be added together to represent the word combinations e.g. king + woman = queen. More info: https://neptune.ai/blog/sentiment-analysis-python-textblob-vs-vader-vs-flair

For our analysis, we will be using the Flair library, and their pre-trained sentiment analysis on an IMDB dataset. This library is especially advanced due to the type of embedding sentiment analysis it uses. Flair uses contextual string embeddings to determine the sentiment of words. This treats words as characters, and uses character language models as well as the embeddings of the surrounding text/characters to determine the word’s embedding. In practice, this means that words can be given different embeddings, or meanings, depending on the context. In our example, this means that words can be seen as positive or negative depending on the surrounding lyrics.

It should be noted that this sentiment analysis is trained on an IMDB dataset. For that reason, it may not produce the most accurate analysis and understanding of our words. To take this method one step further - if you have a big enough set - you can train your own sentiment analysis model using this package to create a model specified to the problem.

Copied from the flair github repo: https://github.com/flairNLP/flair

A powerful NLP library. Flair allows you to apply our state-of-the-art natural language processing (NLP) models to your text, such as named entity recognition (NER), part-of-speech tagging (PoS), special support for biomedical data, sense disambiguation and classification, with support for a rapidly growing number of languages.

A text embedding library. Flair has simple interfaces that allow you to use and combine different word and document embeddings, including our proposed Flair embeddings, BERT embeddings and ELMo embeddings.

A PyTorch NLP framework. Our framework builds directly on PyTorch, making it easy to train your own models and experiment with new approaches using Flair embeddings and classes.

Sentiment Analysis Code

from flair.models import TextClassifier
from flair.data import Sentence

# This is the pre-built model and will take awhile to download
classifier = TextClassifier.load('en-sentiment') 
## 2022-12-14 14:39:49,610 loading file /Users/Brennan/.flair/models/sentiment-en-mix-distillbert_4.pt
def sentiment(df):
  
  # Performs sentiment analysis on a text data
  # Input: dataframe with a column named 'lyrics' for text data
  # Output: 
  #       Original dataframe with sentiment scores of individual songs  (dataframe)
  #       summary statistics of scores per genre (float)
  
  return_df = df
  song_scores = []
  lyrics = df.lyrics.tolist()
  # Sums each songs sentiment score to gte an dataset average sentiment score 
  sum_ = 0  
  
  for lyric in lyrics:
    sentence = Sentence(lyric)
    classifier.predict(sentence)
    text = str(sentence.labels)
    song_scores.append(text.split('/')[1])
    score = text.split('/')[1]
    num = float(score.split('(')[1].split(')')[0])
    if score.__contains__("NEGATIVE"):
      num = num * -1
    sum_ += num
    
  return_df['Sentiment_score'] = pd.Series(song_scores)
  
  return return_df, sum_
sent_df_rap, avg_scr_rap = sentiment(cleaned_rap_df)
print('rap genre: ', round(avg_scr_rap, 4), '\n\n' 'Song scores: ', sent_df_rap.head(10))
## rap genre:  -49.7011 
## 
## Song scores:              title          artist  song_ids  \
## 0         Rap God          Eminem    235729   
## 1             WAP         Cardi B   5832126   
## 2         HUMBLE.  Kendrick Lamar   3039923   
## 3  Bad and Boujee           Migos   2845980   
## 4      SICKO MODE    Travis Scott   3876994   
## 5      God’s Plan           Drake   3315890   
## 6   Man’s Not Hot        Big Shaq   3244990   
## 7   XO TOUR Llif3    Lil Uzi Vert   3003630   
## 8  1-800-273-8255           Logic   3050777   
## 9    Bodak Yellow         Cardi B   3095483   
## 
##                                               lyrics       Sentiment_score  
## 0  look wa going go easy hurt feeling going get o...  'NEGATIVE' (0.9792)]  
## 1  whores house house house house said certified ...  'POSITIVE' (0.9937)]  
## 2  nobody pray day way remember syrup sandwich cr...  'NEGATIVE' (0.9863)]  
## 3  know young rich know something really never ol...  'NEGATIVE' (0.7764)]  
## 4  astro sun freezin cold already know winter daw...  'POSITIVE' (0.8489)]  
## 5  wishin wishin wishin wishin wishin movin calm ...  'NEGATIVE' (0.8825)]  
## 6  yo big shaq one mans hot never hot skrrat skid...  'POSITIVE' (0.9182)]  
## 7  alright alright quite alright money right coun...   'NEGATIVE' (0.992)]  
## 8  low taking time feel mind feel life mine low t...  'NEGATIVE' (0.9818)]  
## 9  ksr cardi said wanted dance said lil wanted ex...  'NEGATIVE' (0.5963)]

Only showing one output

sent_df_rock, avg_scr_rock = sentiment(cleaned_rock_df)
print('rock genre: ', round(avg_scr_rock, 4), '\n\n' 'Song scores: ', sent_df_rock.head(10))
sent_df_pop, avg_scr_pop = sentiment(cleaned_pop_df)
print('pop genre: ', round(avg_scr_pop, 4), '\n\n' 'Song scores: ', sent_df_pop.head(10))
sent_df_rb, avg_scr_rb = sentiment(cleaned_rb_df)
print('rb genre: ', round(avg_scr_rb, 4), '\n\n' 'Song scores: ', sent_df_rb.head(10))
sent_df_country, avg_scr_country = sentiment(cleaned_country_df)
print('country genre: ', round(avg_scr_country, 4), '\n\n' 'Song scores: ', sent_df_country.head(10))

Topic Modeling

Latent Dirichlet Allocation (LDA)

LDA Description

Latent Dirichlet Allocation is an approach to topic modeling. The goal of LDA is to discover hidden, or latent, topics within a set of documents housing text. In the context of our example, documents will represent songs, and text will represent the lyrics. Let’s assume we have K latent topics we are hoping to discover. In LDA, documents can be viewed as k-nomial distributions, where the distribution of k latent topics in each document is the probability of that document being from each latent topic. Our k latent topics as well can be viewed as distributions of each word being used in each topic. These two distributions can be estimated through an iterative process to create latent topics where suitable words and documents are grouped together, creating topics grouped by words, and documents given suitable topics based on those words.

LDA starts by assigning random topics to each word in each document. Then, the algorithm selects one word to update its topic classification. With the aforementioned distributions, the algorithm calculates the probability of each topic given the document (found by taking the counts of all the topics for all the other words in the document) and the probability of each word for each topic (found by taking the counts of each word with each topic across all documents) and multiplies them to find the probability that each topic generated that word. We then pick the most likely topic and assign the word that new topic. This process is repeated for all words in all documents, and then iterated over to reach a steady state of latent topics.

Through each iteration, topic classifications are made based upon how well a word fits a topic, and how well that topic fits the document. Because initial assignments are random, our topic distributions will do a poor job at assigning words new topics, but eventually suitable words and topics will be paired together through the topic assignments. This will create a topic classification distribution for every document, as well as generating sets of common words for topics.

Below is a graphical model representing the Latent Dirichlet Allocation we are performing. Without getting into all of the details, you can see what variables are what and understand how each part is estimated

Θ: Topic mixes for each document Z: Topic assignment for each word W: word in each document β: distribution of words for each topic N: Number of words M: Number of documents 𝜶: distribution of topics in documents 𝜂: distribution of words in topics

LDA Code

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

LDA coherence: https://rare-technologies.com/what-is-topic-coherence/

def LDA_topics(df = cleaned_full_df,n_topics=10,top_n_words=15):
  
  # Split song lyrics into individual strings
  sep_lyrics_list = []
  for song in df.lyrics:
    sep_lyrics = song.split()
    sep_lyrics_list.append(sep_lyrics)
  
  # Bigram model --  two words frequently occurring together in a song
  bigram_init = gensim.models.Phrases(sep_lyrics_list)
  bigram_model = gensim.models.phrases.Phraser(bigram_init)
  lyrics_bigrams = [bigram_model[lyric] for lyric in sep_lyrics_list]
  
  # Create corpus dictionary
  id2word = corpora.Dictionary(lyrics_bigrams)
  
  # Term frequency
  corpus = [id2word.doc2bow(bigram) for bigram in lyrics_bigrams]
  
  # LDA model
  lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                          id2word=id2word,
                                          num_topics=n_topics, 
                                          random_state=1,
                                          update_every=1,
                                          chunksize=100,
                                          passes=20,
                                          alpha='auto',
                                          per_word_topics=True)
  
  # Perplexity: measure of model performance (the lower the value the better the performance)
  print('Perplexity: ', lda_model.log_perplexity(corpus))
  coherence_model_lda = CoherenceModel(model=lda_model, 
                                       texts=lyrics_bigrams, 
                                       dictionary=id2word, 
                                       coherence='c_v')
  
  # Coherence: measure of interpretability
  coherence_lda = coherence_model_lda.get_coherence()
  print('Coherence Score: ', coherence_lda)
  
  topic_nums = []
  words = []
  
  # Create list of topics and nested list of words for data frame
  for index, topic in lda_model.show_topics(formatted=False, 
                                        num_words=top_n_words,
                                        num_topics=n_topics):
    topic_nums.append(index)
    words.append([word[0] for word in topic])
  
  # Initial data frame -- formatting needed
  init_df = pd.DataFrame({'Topic':topic_nums,
                    'Words': words})
  
  # Split "Word" column into len(top_n_words) columns
  split_words_df = pd.DataFrame(init_df['Words'].to_list(), 
                            columns=['Word ' + str(i) for i in range(top_n_words)])
  split_words_df['Topic'] = topic_nums
  
  # Reorder the columns (Topic first)
  cols = split_words_df.columns.tolist()
  cols = cols[-1:] + cols[:-1]
  
  # Final LDA column
  LDA_df = split_words_df[cols]
  
  return LDA_df
LDA_topics(cleaned_full_df)
## Perplexity:  -8.163395244557206
## Coherence Score:  0.35423024601321873
##    Topic      Word 0         Word 1         Word 2                Word 3  \
## 0      0          wa  remember_well          night                  keep   
## 1      1    run_away             go            let           look_around   
## 2      2        know           love           want                   get   
## 3      3       crazy    inside_head  switchin_side               nothing   
## 4      4        know            gon            get                  girl   
## 5      5     bam_bam          ey_ey           girl                    el   
## 6      6        want           baby           love                   run   
## 7      7        tell            way          going                  good   
## 8      8       dokie           dual        chuckie  cigarettes_cigarette   
## 9      9  young_dumb     dumb_broke        badeeya           young_young   
## 
##          Word 4     Word 5       Word 6         Word 7      Word 8  \
## 0           day   new_york         back  feelin_feelin       queen   
## 1         na_na    hey_hey        going           work   work_work   
## 2         never        say           go          would          wa   
## 3          head       well  think_crazy     crazy_kind        whoo   
## 4     wild_wild        one      hol_hol          going        good   
## 5        bam_ey     ey_bam    bam_dilla   know_hotline  bling_mean   
## 6         money        get        let_u             go         nah   
## 7           see       life        thing        nothing         one   
## 8  cocoa_butter    cointel        colin         coyote    craziest   
## 9     kill_vibe  vibe_kill       baduda  badeeya_deeya  school_kid   
## 
##         Word 9                          Word 10                 Word 11  \
## 0        never                             know                     low   
## 1        daddy                             ever              daddy_said   
## 2         baby                             time                     one   
## 3  stop_holdin                             dyin                  wanted   
## 4         feel                             back                     god   
## 5          que                               tu                      de   
## 6          gon                             girl                    take   
## 7   better_man                              low                    lost   
## 8    dank_miss                         deadbeat               delegated   
## 9   broke_high  yadadadadadadada_yadadadadadada  yadadadadadadada_young   
## 
##        Word 12       Word 13          Word 14  
## 0      wa_rare     ring_fire         remember  
## 1         uhoh           try      always_stay  
## 2          see           let             feel  
## 3     hope_die        oohooh       tryin_save  
## 4      careful           hit  drippin_finesse  
## 5           mi  country_road               bo  
## 6  murder_mind           boy  loyalty_loyalty  
## 7         back          take             wish  
## 8  chlorophyll        dougie            huggy  
## 9       remain         deeya             badu
LDA_topics(cleaned_rap_df)
## Perplexity:  -8.108736558304567
## Coherence Score:  0.28773929568832246
##    Topic      Word 0           Word 1     Word 2           Word 3 Word 4  \
## 0      0          el          monster       slay        slay_slay    que   
## 1      1        know             want       need              get   girl   
## 2      2        know             love       back               wa    boy   
## 3      3        bish        enemy_lot     bought  oohoohoohoohooh    way   
## 4      4        know             want        get             feel   need   
## 5      5        want               wa        get             know   love   
## 6      6  inside_dna           witchu  give_give             fine    dna   
## 7      7       never               go        get             make  still   
## 8      8         get  versace_versace        hit       gucci_gang     go   
## 9      9        know              get       back              man    see   
## 
##       Word 5         Word 6      Word 7  Word 8        Word 9      Word 10  \
## 0   see_hand         medusa   lifestyle    said      beginnin          top   
## 1       real            say       never    love           see         make   
## 2        get            let      let_go   still          girl          one   
## 3     strong       ohohohoh   front_gun     see          wave        drain   
## 4       time             go        love   would          make         life   
## 5      would           give        back      go           say         time   
## 6  remy_boyz  anything_give  lil_stupid  yeaaah           rot       rewind   
## 7        way            put        give    take  cocoa_butter         back   
## 8       want       new_york         put      uh          back  gon_alright   
## 9       want             go         one      wa          girl         tell   
## 
##     Word 11    Word 12       Word 13     Word 14  
## 0      baby     thugga            de        take  
## 1     money         go           bad        tell  
## 2      made      think         right     thought  
## 3  standing  duckworth  dollar_might  nah_dollar  
## 4       one      right          take         say  
## 5      life      night          make       never  
## 6   loyalty      sewed         monty         wit  
## 7      keep        say           til       could  
## 8      know       make           man       right  
## 9     going      money         never         gon
LDA_topics(cleaned_rock_df)
## Perplexity:  -7.428911702453029
## Coherence Score:  0.3866658150907866
##    Topic               Word 0         Word 1       Word 2            Word 3  \
## 0      0                 know             wa        would              love   
## 1      1              ever_wa             go  letting_day             water   
## 2      2                 make           high         want           day_die   
## 3      3                na_na            say         well          get_high   
## 4      4                  get           know           go             never   
## 5      5                   go       run_away     need_run               let   
## 6      6                 want             go          see             never   
## 7      7  save_heavydirtysoul  bennie_bennie         know  watermelon_sugar   
## 8      8                  one            say        would               let   
## 9      9                going            get        every        walk_alone   
## 
##           Word 4         Word 5         Word 6           Word 7      Word 8  \
## 0            one            see            way              let        want   
## 1  water_flowing           time  coming_coming             away         one   
## 2           life           know           give           friend    let_vibe   
## 3           baby           good           want             feel        take   
## 4            see           baby           love      another_one  going_quit   
## 5     take_money           long          could       know_daddy        make   
## 6           mine           take           love              get           u   
## 7      save_save           want         bennie  high_watermelon          go   
## 8           make          thing       hey_jude       might_also        love   
## 9           know  getting_dizzy        natural             walk          wa   
## 
##   Word 9      Word 10      Word 11      Word 12     Word 13       Word 14  
## 0  never         time         back          get          go          life  
## 1    far           wa          may  shoot_shoot  start_fire   since_world  
## 2   away  nobody_drag         back        music        said         taken  
## 3   know         said         tell        guess        look           god  
## 4  could          hey    bite_dust         said        ohoh          make  
## 5   girl        thing         take         stay        feel         never  
## 6    say  purple_rain         know       always        feel          need  
## 7   jets   sugar_high  jets_bennie    therefore         mad           say  
## 8    run   shut_mouth       better  free_fallin        feel  thinkin_much  
## 9  heart         love        never        light        full     take_hand
LDA_topics(cleaned_pop_df)
## Perplexity:  -7.4807874561033465
## Coherence Score:  0.3089222477529462
##    Topic   Word 0     Word 1         Word 2         Word 3 Word 4  \
## 0      0     know         wa           want           love  never   
## 1      1     know       make  feelin_feelin          going    get   
## 2      2     want  something           back             go    doo   
## 3      3      get       girl             go           baby  going   
## 4      4     know        one           baby           love    get   
## 5      5  hol_hol         ah      love_sent  day_christmas   true   
## 6      6     know       want            get          would    see   
## 7      7      one       want           know     want_alive   time   
## 8      8     love       want          night            way  often   
## 9      9     love       know           baby           want   life   
## 
##                 Word 5         Word 6        Word 7        Word 8  \
## 0                  get            way          good            go   
## 1                 baby          never           que          girl   
## 2  doodoodoo_doodoodoo        na_nana            na  ceiling_hold   
## 3                 take           know          want         right   
## 4                going           back         thing           cry   
## 5       partridge_pear           look  done_starboy   turtle_dove   
## 6                  say           tell          make          time   
## 7                 mind  pickin_loving     tear_left   girl_bummer   
## 8                 time      call_name          keep           run   
## 9            baby_baby          could          feel           see   
## 
##           Word 9         Word 10      Word 11     Word 12    Word 13  \
## 0            see             let         baby       would        say   
## 1            let             hot         back          tu       good   
## 2      fight_til  moment_tonight    u_ceiling   time_fore  go_higher   
## 3           back            make          see        love       body   
## 4            way        run_away          let   look_look      would   
## 5   three_french         hen_two         tree       young     thrill   
## 6            one            feel         take       could        let   
## 7  throw_tantrum        hate_hot  anthem_turn        feel     turnin   
## 8           girl            need        might  make_earth       pipe   
## 9          going            mine         knew  young_dumb       stay   
## 
##         Word 14  
## 0         could  
## 1     read_read  
## 2  power_taking  
## 3           say  
## 4       alright  
## 5          need  
## 6            go  
## 7  livin_pickin  
## 8          call  
## 9         heart
LDA_topics(cleaned_rb_df)
## Perplexity:  -7.319789081104805
## Coherence Score:  0.31427322069409414
##    Topic       Word 0         Word 1  Word 2       Word 3        Word 4  \
## 0      0     flawless  feelin_feelin     lie     god_damn           die   
## 1      1         know           baby    time          way          want   
## 2      2        often        hol_hol  mornin         turn        matter   
## 3      3   thank_next        one_two   found  look_around     work_work   
## 4      4  murder_mind      wild_wild    wild          get  wild_thought   
## 5      5         baby           want     see         feel          know   
## 6      6         love           life    back         know           one   
## 7      7         know        nothing    feel         baby            go   
## 8      8         love           know    want          get          girl   
## 9      9          one           know    make     say_name          girl   
## 
##          Word 5      Word 6       Word 7         Word 8     Word 9 Word 10  \
## 0     look_sexy         end           go              b       said   never   
## 1            wa        love          let           mind       feel     say   
## 2          side        make         girl          hello    anymore    want   
## 3  worried_bout  chandelier      hey_hey  greatest_city      three  turned   
## 4           god        king  church_wild            mob    alright    beat   
## 5          halo   heartless          way           tell  halo_halo    need   
## 6          girl        take         time       run_away         go    need   
## 7         bring        want         love           girl      going     one   
## 8           say        need           wa          never      would    baby   
## 9      new_york         let      bam_bam           life      ey_ey     run   
## 
##        Word 11      Word 12       Word 13          Word 14  
## 0        swear  cross_heart  take_element             hope  
## 1         girl        right            go              one  
## 2          say          til          time            sorry  
## 3  three_drink   corny_wish           doe  tonight_holding  
## 4        human        white        rockin             coke  
## 5         girl         time          life            light  
## 6           wa          let         could             want  
## 7          get         make          need           really  
## 8        thing          see          good             feel  
## 9        never         live          love             made
LDA_topics(cleaned_country_df)
## Perplexity:  -7.316227106159335
## Coherence Score:  0.3988602773158466
##    Topic      Word 0         Word 1         Word 2         Word 3  \
## 0      0          wa  remember_well  say_something            god   
## 1      1         red          still        nothing          would   
## 2      2        take             wa           love             go   
## 3      3        wish          never             wa           road   
## 4      4  better_man  country_music      miss_wish  jolene_jolene   
## 5      5  wild_horse     could_drag           away           wild   
## 6      6        know           time          never           love   
## 7      7    get_back             wa           love            see   
## 8      8          wa            one          texas       big_iron   
## 9      9        want           know           home            get   
## 
##         Word 4     Word 5      Word 6         Word 7        Word 8  \
## 0    wind_hair    wa_rare    remember          still     something   
## 1     old_town   ride_til  take_horse     road_going  nothing_tell   
## 2          get        one       would          right          name   
## 3        every       high        home            kid          died   
## 4    long_live       hold        know          could         still   
## 5         ride  ring_fire       thing            let    meant_baby   
## 6          say         go        away            day          turn   
## 7         make       last         man           back         thing   
## 8         town    feleena         men           made        ranger   
## 9  treacherous       time       think  follow_follow           eye   
## 
##         Word 9        Word 10        Word 11  Word 12      Word 13   Word 14  
## 0          ana       stair_wa  maybe_looking    maybe          say   nothing  
## 1  nobody_tell  taste_tequila          ridin      sky        among     youth  
## 2        going           left          never     know         time      back  
## 3      grandpa           back           time     lost         cold    cooler  
## 4       please          would    say_goodbye   become        magic        wa  
## 5   clementine           went     might_also  freedom         pain       lie  
## 6         hold           life           much     baby  always_stay     light  
## 7          say             go           ohoh     ever        alone     going  
## 8      hip_big          would       daughter    tried       outlaw  iron_hip  
## 9        slope            iii           safe    would        alone       say

Advanced Topic Modeling

Bert

Bert Description

Another approach to topic modeling is BERTopic. This is a transformer based model that creates dense clusters allowing us to create interpretable topics that include important words to these topics. Transformers are nice tools to use because they use models that are already created for us and fine tune them to our data. This helps because oftentimes the data that we have isn’t large enough to train a whole model and we also generally don’t have access to powerful enough GPU’s to train these models.

The way that BERT works is by converting the text documents into numerical data using the transformers. The embedded pre-trained models are now updated and fine tuned with the data that we include. We can use sentence-transformers to carry this out.

After transforming the data with the pre-trained models, we want to do some clustering to get the documents with similar topics to cluster together. First to deal with high dimensionality in clustering, we want to reduce the dimensionality finding the right balance between too low, lost information, and too high poor clustering. Once we have a lowered dimensionality we can make our clusters which will end up being the topics that we are looking for. Popular choices for carrying out this clustering are UMAP for reducing the dimensionality and HDBSAN for forming the clusters.

After forming these clusters the next obvious step is to figure out what these clusters represent, with this we are basically comparing the importance of the words between the different documents. A common way to do this is with TF-IDF, and in this case clustered TF-IDF. Instead of looking at the importance of each word compared to its document, you take the clustered documents and look at the importance of the word within the cluster of documents where it appears. With these importances we can get the top 20 or so scores for words which would give us a good idea of the topic that we are looking at.

There are a couple of common issues that people run into when carrying out BERTopic modeling. For example, a lot of transformers have limits on the size of documents so we might have to split documents down into paragraphs, or in the case of songs down into verses. Another issue is that in the end we might end up with a lot of clusters so we would have to do some topic reduction. To do this we would adjust the min_cluster_size in HDBSAN to ensure that we have less topics which would end up being more meaningful.

  1. Generates a representation vector for each document.
  2. UMAP algorithm reduces the dimensions that each vector has.
  3. HDBSCAN algorithm is used for the clustering process.
  4. c-TF-IDF algorithm retrieves the most relevant words for each topic.
  5. Maximize Candidate Relevance algorithm is used to maximize diversity.
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic import BERTopic
import random
import plotly.io as pio

We were able to implement this code in time - BERTopic models take a very long time

data = cleaned_full_df.lyrics.to_list()
datasmall = data
#datasmall = random.sample(data, 200)

Topic Visualization

topic_model = BERTopic() # create model 
model = BERTopic(nr_topics=20) # specify dimensions
topics, probs = model.fit_transform(datasmall) # fit model
# Get top 10 topics and the associated words
model.get_topic(1)
## [('versace', 0.3466758867560014), ('comin', 0.2535267531969895), ('feel', 0.21283023797881748), ('babe', 0.1319302262124721), ('baby', 0.11843354472520784), ('dusk', 0.10451005625699201), ('dawn', 0.0972775882532528), ('till', 0.09494247304884539), ('girl', 0.06730301111162452), ('love', 0.06553494133301181)]
model.get_topic(2)
## [('get', 0.037234276978791815), ('know', 0.03353845096947296), ('want', 0.029842139035109034), ('one', 0.026727204347575355), ('go', 0.024929728120388194), ('back', 0.02324432085522547), ('see', 0.020235305205870167), ('love', 0.020062295595921306), ('let', 0.01957727594371185), ('girl', 0.01940712932636818)]
model.get_topic(3)
## [('love', 0.05332289698815026), ('know', 0.0513788165105225), ('wa', 0.04185401605081198), ('would', 0.039284996767431496), ('want', 0.0382747887939472), ('never', 0.03663786012062019), ('say', 0.031601318228263474), ('go', 0.03110966809236601), ('time', 0.03040700571206907), ('see', 0.030226845920717792)]
pio.show(model.visualize_topics()) # Inter-topic distance map
pio.show(model.visualize_barchart()) # visualize topics with top words
pio.show(model.visualize_heatmap()) # Visual topic similarity

Extra Functions to expand dataset creation

  • With minimum work you could greatly expand the possibilities of dataset creation
  • The greatest challenge is dealing with scale as random terminations continually occur as you automate the API calls, and sometimes even the try/except/pass cheat code won’t save you
  • Also Genius may just suspend your access token for some time if you’re making too many API calls
  • Try to use this code and edit it so that you use the genius agent (genius.song(song)) (API calls) as little as possible
  • Yet this is tricky if you want a lyric dataset because albums can be search with song ids, but the returned info does not include lyrics.
  • Best approach so far: Use top_charts() to get dataframe, use the song ids to get the album for each song, get all songs ids for each album, then finally use that final list of song ids to retrieve song_lyrics
  • So with the the functions provided, first call top_charts(), then call album_songs() with the top chart songs ids, and then once you’ve collected enough albums/songs call song_info with the complete list of song ids
  • The make_dataset is an incomplete attempt to make a super - function to condense code and to automate as much as possible . It will run and begin to automate the whole process, but there is an index error in that is doesn’t know when to stop and will terminate the process. Could possibly be fixed in very little time.
def album_songs(song_ids):
  # Takes in a list of song ids and return a df of every song in each on the input songs' album
  albums = list()         
  songs = list()         
  artists = list()
  new_song_ids = list()
  for song in song_ids:
    t = True
    while t == True:
      try:
        # Search song API with parameters
        song_info = genius.song(song)
      except:    
        pass
      else:
        t = False
        if song_info['song']['album'] != None:
          album_id = song_info['song']['album']['id']
          album_name = song_info['song']['album']['name']
          album_artist = song_info['song']['artist_names']
      t = True
      while t == True:
        try:
          # Search song API with parameters
          album_dict = genius.album_tracks(album_id)
        except:    
          pass
        else:
          t = False
          len_album = len(album_dict['tracks'])
          for track in range(len_album):
              song_id = album_dict['tracks'][track]['song']['id']
              song_name = album_dict['tracks'][track]['song']['title']
              artists.append(album_artist)
              albums.append(album_name)
              songs.append(song_name)
              new_song_ids.append(song_id)
  df = pd.DataFrame({'song': songs, 'album':albums,'artist':artists,'song_ids':new_song_ids})
  return df
def make_dataset(genre, time_period = 'all_time',n_per_page=20, pages=1):
  
  df =top_charts(genre='all',time_period=time_period,n_per_page=n_per_page,pages=2)
  
  df = album_songs(df['song_ids'])
  
  myDict = {k: v for k, v in zip(df['song_ids'], df['album'])}
  df = song_info(df['song_ids'])
  
  new_albums = list()
  for i in df['song_ids']:
    for j in myDict.keys():
      if i == j:
        new_albums.append(myDict.get(j))
        
  df = pd.concat([df, 
                pd.Series(new_albums, name = 'album',dtype='float64')], axis=1)
  csv_name = 'topcharts_'+time_period+genre
  path = os.getcwd()
  df.to_csv(path+csv_name+'.csv')
  return df